RESUMO
The accuracy and performance of (Q)SAR models depend significantly on the data used for training. Datasets prepared on the basis of publicly available databases contain structures belonging to different chemical classes and have a highly imbalanced actives/inactives ratio. Currently, hundreds of structural descriptors are used in (Q)SAR studies. The abundance of structural descriptors gives rise to the problem of the constructed (Q)SAR models stability. The methods frequently used for the selection of a small fraction of the 'best' descriptors usually do not have sufficient mathematical justification. We propose a new approach to a self-consistent classifier for SAR analysis in order to overcome these problems. Logistic (SCLC) and extreme (SCEC) extensions of self-consistent regression (SCR) were implemented to enhance the classification capabilities of SCR. The approach was applied to classification models' development for inhibiting activity endpoints in HIV-1-related data and toxicity endpoints with subsequent fivefold cross-validation to estimate the models' performance. Comparison of the proposed SCLC and SCEC models with those developed using the original SCR and support vector machine demonstrated the comparable accuracy. Advantages in feature selection using our approach provide more generalizable (Q)SAR models. In particular, the crucial factors responsible for the observed value are determined unambiguously.
Assuntos
Técnicas de Química Analítica , Modelos Teóricos , Relação Quantitativa Estrutura-Atividade , Máquina de Vetores de SuporteRESUMO
Despite significant advances in the application of highly active antiretroviral therapy, the development of new drugs for the treatment of HIV infection remains an important task because the existing drugs do not provide a complete cure, cause serious side effects and lead to the emergence of resistance. In 2015, a consortium of American and European scientists and specialists launched a project to create the SAVI (Synthetically Accessible Virtual Inventory) library. Its 2016 version of over 283 million structures of new easily synthesizable organic molecules, each annotated with a proposed synthetic route, were generated in silico for the purpose of searching for safer and more potent pharmacological substances. We have developed an algorithm for comparing large chemical databases (DB) based on the representation of structural formulas in SMILES codes, and evaluated the possibility of detecting new antiretroviral compounds in the SAVI database. After analyzing the intersection of SAVI with 97 million structures of the PubChem database, we found that only a small part of the SAVI (~0.015%) is represented in PubChem, which indicates a significant novelty of this virtual library. However, among those structures, 632 compounds tested for anti-HIV activity were detected, 41 of which had the desired activity. Thus, our studies for the first time demonstrated that SAVI is a promising source for the search for new anti-HIV compounds.